fix(runtime): refresh cloud API token in-process for long-running agents (cloud#2307 Half 2)#246
fix(runtime): refresh cloud API token in-process for long-running agents (cloud#2307 Half 2)#246khaliqgant wants to merge 1 commit into
Conversation
…nts (cloud#2307) The proactive sandbox's CLOUD_API_ACCESS_TOKEN is a relayfile-audience JWT minted once at launch with a ~2h TTL and no in-sandbox refresh. A long-running proactive agent that posts to Slack after the TTL lapses sends an expired-but-validly-signed bearer; the cloud verifier returns null and /api/v1/slack/post-message 401s wholesale. Add `ensureFreshCloudApiToken` + `startCloudApiTokenRefresher` and start the refresher in `startRunner` (the `node runner.mjs` subprocess where the relayflows Slack client is built and reads process.env.CLOUD_API_TOKEN). The refresher: - ticks under the 5-min skew, refreshing only within the window; - rotates-and-persists both CLOUD_API_ACCESS_TOKEN and CLOUD_API_TOKEN (the relayflows adapter env fallback) in place; - serializes a single in-flight refresh (relayauth refresh tokens are single-use; reuse cascade-revokes the session); - stops + signals on delegation-horizon (refresh 401) for out-of-band re-mint. Stacks on cloud#2313, which mints the token as a refreshable pair and injects CLOUD_API_REFRESH_TOKEN / CLOUD_API_ACCESS_TOKEN_EXPIRES_AT / CLOUD_API_REFRESH_URL. No-op when that refresh material is absent (local/dev). 10 new unit tests; full runtime suite 123/123, tsc --noEmit clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Warning Review limit reached
More reviews will be available in 8 minutes. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (3)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to keep the cloud API token fresh for the proactive sandbox runner by implementing an in-place token refresher and a background loop. The reviewer's feedback highlights several important improvements: performing an initial token refresh check immediately on startup to prevent using an expired token during the first 60 seconds, using a WeakMap keyed by the environment object to isolate in-flight refresh promises and prevent cross-contamination in concurrent or multi-tenant environments, and adding a timeout to the token refresh fetch request to avoid blocking indefinitely if the authentication server hangs.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| import { buildCtx, type CtxBuildOptions } from './ctx.js'; | ||
| import { getTrajectoryRecorder, type TrajectoryRecorder } from './trajectory.js'; | ||
| import { isWorkforceHandler } from './handler.js'; | ||
| import { startCloudApiTokenRefresher } from './cloud-api-token-refresh.js'; |
There was a problem hiding this comment.
Import ensureFreshCloudApiToken and CloudApiTokenHorizonError to perform an initial token refresh check on startup.
| import { startCloudApiTokenRefresher } from './cloud-api-token-refresh.js'; | |
| import { ensureFreshCloudApiToken, startCloudApiTokenRefresher, CloudApiTokenHorizonError } from './cloud-api-token-refresh.js'; |
| const tokenRefresher = process.env.CLOUD_API_REFRESH_URL | ||
| ? startCloudApiTokenRefresher({ | ||
| onHorizonElapsed: (err) => | ||
| ctx.log('warn', 'runner.cloud-api-token.horizon-elapsed', { error: err.message }), | ||
| onError: (err) => | ||
| ctx.log('warn', 'runner.cloud-api-token.refresh-failed', { | ||
| error: err instanceof Error ? err.message : String(err) | ||
| }) | ||
| }) | ||
| : null; |
There was a problem hiding this comment.
If the runner is cold-started and the token in the environment is already expired or near expiry, the background refresher (which ticks only after 60 seconds by default) will leave a 60-second window where the runner uses an expired token and fails with 401. Performing an initial refresh check immediately on startup ensures the token is fresh before any envelopes are processed. Additionally, guard both the initial refresh and the background refresher on both CLOUD_API_REFRESH_URL and CLOUD_API_REFRESH_TOKEN to avoid unnecessary work if the refresh token is missing.
if (process.env.CLOUD_API_REFRESH_URL && process.env.CLOUD_API_REFRESH_TOKEN) {
try {
await ensureFreshCloudApiToken();
} catch (err) {
if (err instanceof CloudApiTokenHorizonError) {
ctx.log('warn', 'runner.cloud-api-token.horizon-elapsed', { error: err.message });
} else {
ctx.log('warn', 'runner.cloud-api-token.refresh-failed', {
error: err instanceof Error ? err.message : String(err)
});
}
}
}
const tokenRefresher = process.env.CLOUD_API_REFRESH_URL && process.env.CLOUD_API_REFRESH_TOKEN
? startCloudApiTokenRefresher({
onHorizonElapsed: (err) =>
ctx.log('warn', 'runner.cloud-api-token.horizon-elapsed', { error: err.message }),
onError: (err) =>
ctx.log('warn', 'runner.cloud-api-token.refresh-failed', {
error: err instanceof Error ? err.message : String(err)
})
})
: null;| // The sandbox holds one cloud API identity, so a process-wide guard is correct: | ||
| // concurrent posters await the SAME rotation rather than each spending the | ||
| // single-use refresh token (which would cascade-revoke the session). | ||
| let inFlightRefresh: Promise<string> | null = null; |
There was a problem hiding this comment.
Using a single global inFlightRefresh promise can cause cross-contamination if multiple runners or environments are executed in the same process (e.g., in concurrent test suites or multi-tenant environments). Using a WeakMap keyed by the env object isolates the in-flight refresh promises per environment.
| let inFlightRefresh: Promise<string> | null = null; | |
| let inFlightRefreshes = new WeakMap<CloudApiTokenEnv, Promise<string>>(); |
| if (!inFlightRefresh) { | ||
| const fetchImpl = opts.fetchImpl ?? fetch; | ||
| inFlightRefresh = refreshAndPersist(env, refreshUrl, refreshToken, fetchImpl).finally(() => { | ||
| inFlightRefresh = null; | ||
| }); | ||
| } | ||
| return inFlightRefresh; |
There was a problem hiding this comment.
Update the in-flight refresh resolution to use the new inFlightRefreshes WeakMap to ensure proper isolation per environment.
let inFlight = inFlightRefreshes.get(env);
if (!inFlight) {
const fetchImpl = opts.fetchImpl ?? fetch;
inFlight = refreshAndPersist(env, refreshUrl, refreshToken, fetchImpl).finally(() => {
inFlightRefreshes.delete(env);
});
inFlightRefreshes.set(env, inFlight);
}
return inFlight;| const response = await fetchImpl(refreshUrl, { | ||
| method: 'POST', | ||
| headers: { 'content-type': 'application/json' }, | ||
| body: JSON.stringify({ refreshToken }) | ||
| }); |
There was a problem hiding this comment.
The fetch request to refresh the token does not have a timeout. If the auth server hangs or is extremely slow, this request will block indefinitely, preventing any concurrent or subsequent callers from proceeding. Adding a timeout using AbortSignal.timeout prevents this issue.
| const response = await fetchImpl(refreshUrl, { | |
| method: 'POST', | |
| headers: { 'content-type': 'application/json' }, | |
| body: JSON.stringify({ refreshToken }) | |
| }); | |
| const response = await fetchImpl(refreshUrl, { | |
| method: 'POST', | |
| headers: { 'content-type': 'application/json' }, | |
| body: JSON.stringify({ refreshToken }), | |
| signal: AbortSignal.timeout(15000) | |
| }); |
| export function resetInFlightRefreshForTests(): void { | ||
| inFlightRefresh = null; | ||
| } |
There was a problem hiding this comment.
Update the test helper to reset the inFlightRefreshes WeakMap instead of the single global variable.
| export function resetInFlightRefreshForTests(): void { | |
| inFlightRefresh = null; | |
| } | |
| export function resetInFlightRefreshForTests(): void { | |
| inFlightRefreshes = new WeakMap<CloudApiTokenEnv, Promise<string>>(); | |
| } |
|
pr-reviewer could not complete review for #246 in AgentWorkforce/workforce. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 967b8d6f74
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (accessToken && expiresAt) { | ||
| const expiresMs = Date.parse(expiresAt); | ||
| if (Number.isFinite(expiresMs) && expiresMs - now > skewMs) { | ||
| return accessToken; |
There was a problem hiding this comment.
Keep CLOUD_API_TOKEN in sync on fresh tokens
When the access token is still outside the skew window, this early return leaves CLOUD_API_TOKEN untouched. In runtimes where only CLOUD_API_ACCESS_TOKEN is injected, or where CLOUD_API_TOKEN still contains an older value, the runner's relayflows path that reads process.env.CLOUD_API_TOKEN continues to use a missing/stale bearer until the first refresh near expiry. Copy the fresh access token into the alias before returning, or include the alias in the freshness check.
Useful? React with 👍 / 👎.
| schedule(); | ||
| } | ||
|
|
||
| schedule(); |
There was a problem hiding this comment.
Refresh once before waiting for the first interval
Starting the refresher only schedules the first check after intervalMs, so if the runner process starts with a token already expired or inside the skew window (for example after restarting in a long-lived sandbox), startRunner can dispatch an envelope and post to Slack during this initial 60s gap with the old bearer. Kick off tick() immediately, or refresh before the dispatch loop begins, so startup state is covered too.
Useful? React with 👍 / 👎.
|
pr-reviewer could not complete review for #246 in AgentWorkforce/workforce. |
|
Closing: placement disproven. The cloud#2307 401 is on |
|
pr-reviewer could not complete review for #246 in AgentWorkforce/workforce. |
|
ℹ️ pr-reviewer: review only — no file changes were applied to the PR (nothing to commit after review). The notes below are advisory and were not pushed. pr-reviewer could not complete review for #246 in AgentWorkforce/workforce. |
|
ℹ️ pr-reviewer: review only — no file changes were applied to the PR (nothing to commit after review). The notes below are advisory and were not pushed. I made no file edits — the PR is mechanically clean (build, lint, typecheck all pass) and the logic is sound, so there was nothing to auto-fix. Here is my review. Review: PR #246 —
|
|
pr-reviewer could not complete review for #246 in AgentWorkforce/workforce. |
1 similar comment
|
pr-reviewer could not complete review for #246 in AgentWorkforce/workforce. |
|
ℹ️ pr-reviewer: review only — no file changes were applied to the PR (nothing to commit after review). The notes below are advisory and were not pushed. pr-reviewer could not complete review for #246 in AgentWorkforce/workforce. |
What
Half 2 of the cloud#2307 fix — the consumer that actually rotates the proactive sandbox's cloud API token before it expires. Stacks on cloud#2313 (Half 1, which mints the token as a refreshable pair + injects the refresh env vars).
Why
/api/v1/slack/post-messageis authenticated withCLOUD_API_ACCESS_TOKEN— a relayfile-audience JWT minted once at sandbox creation with a ~2h TTL and no in-sandbox refresh. Posts within 2h work; a long-running proactive agent that posts after the TTL (e.g. this incident, where the Slack post fired only after a multi-hour DB recovery) sends an expired-but-validly-signed bearer → verifier returns null → wholesale 401. Latent expiry bug, not a regression.How
packages/runtime/src/cloud-api-token-refresh.ts:ensureFreshCloudApiToken(env)— refreshes viaPOST ${CLOUD_API_REFRESH_URL}(relayauth/v1/tokens/refresh) only when the access token is within the 5-min skew of expiry; rotates-and-persists bothCLOUD_API_ACCESS_TOKENandCLOUD_API_TOKEN(the relayflows adapter env fallback) in place; serialized single-in-flight (relayauth refresh tokens are single-use — reuse cascade-revokes the session); throwsCloudApiTokenHorizonErroron a 401 (delegation horizon elapsed → out-of-band re-mint).startCloudApiTokenRefresher(opts)— background loop (setTimeout,unref'd), tick < skew.startRunner(runner.ts): the refresher runs inside thenode runner.mjssubprocess where the relayflows Slack client is constructed and readsprocess.env.CLOUD_API_TOKEN. (A parent-process refresher would NOT reach it — env is copied at spawn.) Guarded onCLOUD_API_REFRESH_URL, stopped in afinally.Placement note
The persona's Slack client (relayflows
SlackStepExecutor→new SlackClient) is built per step in the runner subprocess and resolves the tokenconfig.cloudApiToken ?? env.CLOUD_API_TOKEN. Keeping that subprocess'sprocess.envfresh means each per-step client picks up a live token — no relayflows change, no token-provider needed.Testing
tsc --noEmitclean.Sequencing
No-op without cloud#2313's injected refresh vars, so it's safe to merge in either order, but the 401 only fully clears once both land + deploy.
🤖 Generated with Claude Code